{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ecab9b6f",
   "metadata": {},
   "source": [
    "# Compression Techniques"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cc408741",
   "metadata": {},
   "source": [
    "This notebook figures out what compression techniques are best suited for the database. It tries a varity of methods including csv, pickle and hdf. Based on these findings the compression technique is chosen. Here, the most important thing is **file size** given that every time someone access the database he is required to download the data file (unless stored locally). Therefore, I do not only test how long it takes for the files to get read in locally but also how long it would take to do so remotely.\n",
    "\n",
    "It uses the methodology as described here: https://towardsdatascience.com/still-saving-your-data-in-csv-try-these-other-options-9abe8b83db3a\n",
    "\n",
    "___\n",
    "The conclusion is that **Pickle (xz)** results in the most efficient loading. While it does have a significant write time, that is not a big deal since GitHub Actions does the conversion. However, to solve the vulnerability issue that arrises with loading with Pickles I've decided to take the next best thing, this is the **CSV BZ2** option which is about the same in terms of loading.\n",
    "___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d0599695",
   "metadata": {},
   "outputs": [],
   "source": [
    "import financedatabase as fd\n",
    "import os\n",
    "\n",
    "import warnings\n",
    "\n",
    "warnings.simplefilter(action=\"ignore\", category=FutureWarning)\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "warnings.simplefilter(action=\"ignore\", category=pd.errors.PerformanceWarning)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "094b3c22",
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"compression/equities\"\n",
    "equities = fd.Equities()\n",
    "df = equities.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0d2a1608",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [method, file_size, write_time, read_time]\n",
       "Index: []"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_results = pd.DataFrame(columns=[\"method\", \"file_size\", \"write_time\", \"read_time\"])\n",
    "display(df_results)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7e648c28",
   "metadata": {},
   "source": [
    "## CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bff35dca",
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_result(df_results, method, file_size, write_time, read_time):\n",
    "    row = {\n",
    "        \"method\": method,\n",
    "        \"file_size\": file_size,\n",
    "        \"write_time\": write_time,\n",
    "        \"read_time\": read_time,\n",
    "    }\n",
    "    return pd.concat([df_results, pd.DataFrame([row])], axis=0, ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8689ab07",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.2 s ± 9.24 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "637 ms ± 3.11 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.3169</td>\n",
       "      <td>1.19862</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  method  file_size  write_time  read_time\n",
       "0    CSV    69.3169     1.19862   0.636521"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_csv(filename + '_csv.csv')\n",
    "# ---get the size of file---\n",
    "filesize = os.path.getsize(filename + \"_csv.csv\") / 1024**2\n",
    "# ---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_csv(filename + '_csv.csv')\n",
    "# ---save the result to the dataframe---\n",
    "df_results = add_result(\n",
    "    df_results, \"CSV\", filesize, result_save.average, result_read.average\n",
    ")\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "77424bda",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.17 s ± 17.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "608 ms ± 6.45 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.31690</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.99397</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         method  file_size  write_time  read_time\n",
       "0           CSV   69.31690    1.198620   0.636521\n",
       "1  CSV No Index   67.99397    1.172448   0.608208"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_csv(filename + 'no_index_csv.csv', \\\n",
    "                                           index=False)\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '_no_index_csv.csv') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_csv(filename + 'no_index_csv.csv')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'CSV No Index',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4b9ef03e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.14 s ± 30.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "830 ms ± 8.81 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.31690</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.99397</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.48877</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV   69.31690    1.198620   0.636521\n",
       "1         CSV No Index   67.99397    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)   16.48877    3.143661   0.830191"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_csv(filename + 'no_index_csv.gzip', \\\n",
    "                                           index=False, \\\n",
    "                                           compression='gzip')\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '_no_index_csv.gzip') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_csv(filename + 'no_index_csv.gzip', \\\n",
    "                                             compression='gzip')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'CSV No Index (GZIP)',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "64e7b820",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6.12 s ± 58.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "2.07 s ± 108 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_csv(filename + 'no_index_csv.bz2', \\\n",
    "                                           index=False, \\\n",
    "                                           compression='bz2')\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '_no_index_csv.bz2') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_csv(filename + 'no_index_csv.bz2', \\\n",
    "                                             compression='bz2')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'CSV No Index (BZ2)',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cb8a7f38",
   "metadata": {},
   "source": [
    "## Pickle"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d30d2e2b",
   "metadata": {},
   "source": [
    "### Base"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "a0fcfaf7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "112 ms ± 2.99 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "137 ms ± 1.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "4               Pickle  56.924569    0.112318   0.136508"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_pickle(filename + '.pkl')\n",
    "# ---get the size of file---\n",
    "filesize = os.path.getsize(filename + \".pkl\") / 1024**2\n",
    "# ---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(filename + '.pkl')\n",
    "# ---save the result to the dataframe---\n",
    "df_results = add_result(\n",
    "    df_results, \"Pickle\", filesize, result_save.average, result_read.average\n",
    ")\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "13a1fa61",
   "metadata": {},
   "source": [
    "### GZIP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "39c83c55",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.74 s ± 43.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "330 ms ± 1.98 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_pickle(filename + '_gzip.pkl', \\\n",
    "                                              compression='gzip')\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '_gzip.pkl') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(filename + '_gzip.pkl', \\\n",
    "                                                compression='gzip')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (GZIP)',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d7b11ae4",
   "metadata": {},
   "source": [
    "### BZ2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e490687e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.17 s ± 35.2 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "1.25 s ± 15.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>10.235255</td>\n",
       "      <td>7.170931</td>\n",
       "      <td>1.253400</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075\n",
       "6         Pickle (BZ2)  10.235255    7.170931   1.253400"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_pickle(filename + '_bz2.pkl', \\\n",
    "                                              compression='bz2')\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '_bz2.pkl') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(filename + '_bz2.pkl', \\\n",
    "                                                compression='bz2')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (BZ2)',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ee3456f4",
   "metadata": {},
   "source": [
    "### ZIP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "2e147980",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.49 s ± 10.8 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "346 ms ± 1.8 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>10.235255</td>\n",
       "      <td>7.170931</td>\n",
       "      <td>1.253400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>14.490427</td>\n",
       "      <td>1.492701</td>\n",
       "      <td>0.345828</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075\n",
       "6         Pickle (BZ2)  10.235255    7.170931   1.253400\n",
       "7         Pickle (ZIP)  14.490427    1.492701   0.345828"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_pickle(filename + '_zip.pkl', \\\n",
    "                                              compression='zip')\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '_zip.pkl') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(filename + '_zip.pkl', \\\n",
    "                                                compression='zip')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (ZIP)',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "962b5c92",
   "metadata": {},
   "source": [
    "### XZ"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f6fdbd93",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22.1 s ± 756 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "682 ms ± 7.35 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>10.235255</td>\n",
       "      <td>7.170931</td>\n",
       "      <td>1.253400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>14.490427</td>\n",
       "      <td>1.492701</td>\n",
       "      <td>0.345828</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Pickle (xz)</td>\n",
       "      <td>9.435333</td>\n",
       "      <td>22.078919</td>\n",
       "      <td>0.681835</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075\n",
       "6         Pickle (BZ2)  10.235255    7.170931   1.253400\n",
       "7         Pickle (ZIP)  14.490427    1.492701   0.345828\n",
       "8          Pickle (xz)   9.435333   22.078919   0.681835"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_pickle(filename + '_xz.pkl', \\\n",
    "                                              compression='xz')\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '_xz.pkl') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(filename + '_xz.pkl', \\\n",
    "                                                compression='xz')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (xz)',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a612764b",
   "metadata": {},
   "source": [
    "## HDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "af8c506d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "213 ms ± 29.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n",
      "248 ms ± 8.35 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>10.235255</td>\n",
       "      <td>7.170931</td>\n",
       "      <td>1.253400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>14.490427</td>\n",
       "      <td>1.492701</td>\n",
       "      <td>0.345828</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Pickle (xz)</td>\n",
       "      <td>9.435333</td>\n",
       "      <td>22.078919</td>\n",
       "      <td>0.681835</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>HDF</td>\n",
       "      <td>59.208774</td>\n",
       "      <td>0.213282</td>\n",
       "      <td>0.248143</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075\n",
       "6         Pickle (BZ2)  10.235255    7.170931   1.253400\n",
       "7         Pickle (ZIP)  14.490427    1.492701   0.345828\n",
       "8          Pickle (xz)   9.435333   22.078919   0.681835\n",
       "9                  HDF  59.208774    0.213282   0.248143"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---saving---\n",
    "result_save = %timeit -n5 -r5 -o df.to_hdf(filename + '.h5', \\\n",
    "                                           key='key', \\\n",
    "                                           mode='w')\n",
    "#---get the size of file---\n",
    "filesize = os.path.getsize(filename + '.h5') / 1024**2\n",
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_hdf(filename + '.h5', \\\n",
    "                                             key='key', \\\n",
    "                                             mode='r')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'HDF',\n",
    "                        filesize, \n",
    "                        result_save.average, \n",
    "                        result_read.average)\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "92ec334e",
   "metadata": {},
   "source": [
    "## Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fe5b848c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>HDF</td>\n",
       "      <td>59.208774</td>\n",
       "      <td>0.213282</td>\n",
       "      <td>0.248143</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>14.490427</td>\n",
       "      <td>1.492701</td>\n",
       "      <td>0.345828</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Pickle (xz)</td>\n",
       "      <td>9.435333</td>\n",
       "      <td>22.078919</td>\n",
       "      <td>0.681835</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>10.235255</td>\n",
       "      <td>7.170931</td>\n",
       "      <td>1.253400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "9                  HDF  59.208774    0.213282   0.248143\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075\n",
       "7         Pickle (ZIP)  14.490427    1.492701   0.345828\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "8          Pickle (xz)   9.435333   22.078919   0.681835\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "6         Pickle (BZ2)  10.235255    7.170931   1.253400\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_results.sort_values(by=\"read_time\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ccd1d106",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>HDF</td>\n",
       "      <td>59.208774</td>\n",
       "      <td>0.213282</td>\n",
       "      <td>0.248143</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>14.490427</td>\n",
       "      <td>1.492701</td>\n",
       "      <td>0.345828</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>10.235255</td>\n",
       "      <td>7.170931</td>\n",
       "      <td>1.253400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Pickle (xz)</td>\n",
       "      <td>9.435333</td>\n",
       "      <td>22.078919</td>\n",
       "      <td>0.681835</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "9                  HDF  59.208774    0.213282   0.248143\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "0                  CSV  69.316900    1.198620   0.636521\n",
       "7         Pickle (ZIP)  14.490427    1.492701   0.345828\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "6         Pickle (BZ2)  10.235255    7.170931   1.253400\n",
       "8          Pickle (xz)   9.435333   22.078919   0.681835"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_results.sort_values(by=\"write_time\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "df71b780",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>file_size</th>\n",
       "      <th>write_time</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Pickle (xz)</td>\n",
       "      <td>9.435333</td>\n",
       "      <td>22.078919</td>\n",
       "      <td>0.681835</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>10.235255</td>\n",
       "      <td>7.170931</td>\n",
       "      <td>1.253400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>11.744003</td>\n",
       "      <td>6.116914</td>\n",
       "      <td>2.072585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>14.331237</td>\n",
       "      <td>3.735072</td>\n",
       "      <td>0.330075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>14.490427</td>\n",
       "      <td>1.492701</td>\n",
       "      <td>0.345828</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>16.488770</td>\n",
       "      <td>3.143661</td>\n",
       "      <td>0.830191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle</td>\n",
       "      <td>56.924569</td>\n",
       "      <td>0.112318</td>\n",
       "      <td>0.136508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>HDF</td>\n",
       "      <td>59.208774</td>\n",
       "      <td>0.213282</td>\n",
       "      <td>0.248143</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>67.993970</td>\n",
       "      <td>1.172448</td>\n",
       "      <td>0.608208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>69.316900</td>\n",
       "      <td>1.198620</td>\n",
       "      <td>0.636521</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  file_size  write_time  read_time\n",
       "8          Pickle (xz)   9.435333   22.078919   0.681835\n",
       "6         Pickle (BZ2)  10.235255    7.170931   1.253400\n",
       "3   CSV No Index (BZ2)  11.744003    6.116914   2.072585\n",
       "5        Pickle (GZIP)  14.331237    3.735072   0.330075\n",
       "7         Pickle (ZIP)  14.490427    1.492701   0.345828\n",
       "2  CSV No Index (GZIP)  16.488770    3.143661   0.830191\n",
       "4               Pickle  56.924569    0.112318   0.136508\n",
       "9                  HDF  59.208774    0.213282   0.248143\n",
       "1         CSV No Index  67.993970    1.172448   0.608208\n",
       "0                  CSV  69.316900    1.198620   0.636521"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_results.sort_values(by=\"file_size\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cb094c14",
   "metadata": {},
   "source": [
    "## Remote Validation"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "088c2d3d",
   "metadata": {},
   "source": [
    "This can only be ran once the files are pushed to GitHub. This is done outside of this Jupyter Notebook. Note that running this part won't work since I picked one and removed the others from the repository and these statistics are merely to explain why I picked the compression method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "66921a13",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [method, read_time]\n",
       "Index: []"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "repository_path = (\n",
    "    \"https://raw.githubusercontent.com/JerBouma/FinanceDatabase/main/compression/\"\n",
    ")\n",
    "filename = \"equities\"\n",
    "\n",
    "df_results = pd.DataFrame(columns=[\"method\", \"read_time\"])\n",
    "display(df_results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "311dd4aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_result(df_results, method, read_time):\n",
    "    row = {\"method\": method, \"read_time\": read_time}\n",
    "    return pd.concat([df_results, pd.DataFrame([row])], axis=0, ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "7a227e87",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10.3 s ± 1.03 s per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  method  read_time\n",
       "0    CSV  10.281532"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_csv(repository_path + filename + '_csv.csv')\n",
    "# ---save the result to the dataframe---\n",
    "df_results = add_result(df_results, \"CSV\", result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "bdbfa9da",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.36 s ± 644 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         method  read_time\n",
       "0           CSV  10.281532\n",
       "1  CSV No Index   7.362340"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ---load---\n",
    "result_read = (\n",
    "    %timeit -n5 -r5 -o pd.read_csv(repository_path + filename + '_no_index_csv.csv')\n",
    ")\n",
    "# ---save the result to the dataframe---\n",
    "df_results = add_result(df_results, \"CSV No Index\", result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "bcff37a1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.47 s ± 116 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>2.472006</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  read_time\n",
       "0                  CSV  10.281532\n",
       "1         CSV No Index   7.362340\n",
       "2  CSV No Index (GZIP)   2.472006"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_csv(repository_path + filename + '_no_index_csv.gzip', \\\n",
    "                                             compression='gzip')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'CSV No Index (GZIP)', \n",
    "                        result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "f342af8d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.59 s ± 215 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>2.472006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>3.592286</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  read_time\n",
       "0                  CSV  10.281532\n",
       "1         CSV No Index   7.362340\n",
       "2  CSV No Index (GZIP)   2.472006\n",
       "3   CSV No Index (BZ2)   3.592286"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_csv(repository_path + filename + '_no_index_csv.bz2', \\\n",
    "                                             compression='bz2')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'CSV No Index (BZ2)', \n",
    "                        result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "7e144c61",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.32 s ± 139 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>2.472006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>3.592286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>2.319007</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  read_time\n",
       "0                  CSV  10.281532\n",
       "1         CSV No Index   7.362340\n",
       "2  CSV No Index (GZIP)   2.472006\n",
       "3   CSV No Index (BZ2)   3.592286\n",
       "4        Pickle (GZIP)   2.319007"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(repository_path + filename + '_gzip.pkl', \\\n",
    "                                                compression='gzip')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (GZIP)', \n",
    "                        result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "fdf3fac0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.82 s ± 82.1 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>2.472006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>3.592286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>2.319007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>2.821927</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  read_time\n",
       "0                  CSV  10.281532\n",
       "1         CSV No Index   7.362340\n",
       "2  CSV No Index (GZIP)   2.472006\n",
       "3   CSV No Index (BZ2)   3.592286\n",
       "4        Pickle (GZIP)   2.319007\n",
       "5         Pickle (BZ2)   2.821927"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(repository_path + filename + '_bz2.pkl', \\\n",
    "                                                compression='bz2')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (BZ2)',\n",
    "                        result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "5316227c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.62 s ± 58.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>2.472006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>3.592286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>2.319007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>2.821927</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>2.623271</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  read_time\n",
       "0                  CSV  10.281532\n",
       "1         CSV No Index   7.362340\n",
       "2  CSV No Index (GZIP)   2.472006\n",
       "3   CSV No Index (BZ2)   3.592286\n",
       "4        Pickle (GZIP)   2.319007\n",
       "5         Pickle (BZ2)   2.821927\n",
       "6         Pickle (ZIP)   2.623271"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(repository_path + filename + '_zip.pkl', \\\n",
    "                                                compression='zip')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (ZIP)',\n",
    "                        result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "ac1e61c7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.24 s ± 144 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>2.472006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>3.592286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>2.319007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>2.821927</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>2.623271</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (xz)</td>\n",
       "      <td>2.240822</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  read_time\n",
       "0                  CSV  10.281532\n",
       "1         CSV No Index   7.362340\n",
       "2  CSV No Index (GZIP)   2.472006\n",
       "3   CSV No Index (BZ2)   3.592286\n",
       "4        Pickle (GZIP)   2.319007\n",
       "5         Pickle (BZ2)   2.821927\n",
       "6         Pickle (ZIP)   2.623271\n",
       "7          Pickle (xz)   2.240822"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#---load---\n",
    "result_read = %timeit -n5 -r5 -o pd.read_pickle(repository_path + filename + '_xz.pkl', \\\n",
    "                                                compression='xz')\n",
    "#---save the result to the dataframe---\n",
    "df_results = add_result(df_results,\n",
    "                        'Pickle (xz)',\n",
    "                        result_read.average)\n",
    "\n",
    "df_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "61172844",
   "metadata": {},
   "source": [
    "## Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "41648489",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>method</th>\n",
       "      <th>read_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Pickle (xz)</td>\n",
       "      <td>2.240822</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Pickle (GZIP)</td>\n",
       "      <td>2.319007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CSV No Index (GZIP)</td>\n",
       "      <td>2.472006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Pickle (ZIP)</td>\n",
       "      <td>2.623271</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Pickle (BZ2)</td>\n",
       "      <td>2.821927</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CSV No Index (BZ2)</td>\n",
       "      <td>3.592286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CSV No Index</td>\n",
       "      <td>7.362340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CSV</td>\n",
       "      <td>10.281532</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                method  read_time\n",
       "7          Pickle (xz)   2.240822\n",
       "4        Pickle (GZIP)   2.319007\n",
       "2  CSV No Index (GZIP)   2.472006\n",
       "6         Pickle (ZIP)   2.623271\n",
       "5         Pickle (BZ2)   2.821927\n",
       "3   CSV No Index (BZ2)   3.592286\n",
       "1         CSV No Index   7.362340\n",
       "0                  CSV  10.281532"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_results.sort_values(by=\"read_time\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "442f41bb",
   "metadata": {},
   "source": [
    "The conclusion is that **Pickle (xz)** results in the most efficient loading. However, to solve the vulnerability issue that arrises with loading with Pickles I've decided to take the next best thing, this is the **CSV BZ2** option which is about the same in terms of loading."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
